Search CORE

40 research outputs found

SELFIES and the future of molecular string representations

Author: Ai Qianxiang
Barthel Senja
Carson Nessa
Frei Angelo
Frey Nathan C.
Friederich Pascal
Gaudin Théophile
Gayle Alberto Alexander
Krenn Mario
Moosavi Seyed Mohamad
Publication venue
Publication date: 01/01/2022
Field of study

Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, Smiles, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, Smiles has several shortcomings—most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100% robustness: SELF-referencing embedded string (Selfies). Selfies has since simplified and enabled numerous new applications in chemistry. In this perspective, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete future projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages, and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science

Institutional Repository of the Freie Universität Berlin

SELFIES and the future of molecular string representations

Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings -- most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100\% robustness: SELFIES (SELF-referencIng Embedded Strings). SELFIES has since simplified and enabled numerous new applications in chemistry. In this manuscript, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete Future Projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science.Comment: 34 pages, 15 figures, comments and suggestions for additional references are welcome

arXiv.org e-Print Archive

MPG.PuRe

SELFIES and the future of molecular string representations

Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings -- most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100\% robustness: SELFIES (SELF-referencIng Embedded Strings). SELFIES has since simplified and enabled numerous new applications in chemistry. In this manuscript, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete Future Projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science

KITopen

SELFIES and the future of molecular string representations

arXiv.org e-Print Archive

VU Research Portal

Proceedings - University of Groningen

KITopen

ARTS repository - University of Groningen

PubMed Central

MPG.PuRe

Dissertations of the University of Groningen

Note sur quelques empreintes végétales des terrains supérieurs de la Toscane

Author: Gaudin Charles Théophile
Publication venue: Blanchard
Publication date
Field of study

par Charles-Th. GaudinSonderdruck aus: Bulletin de la Société vaudoise des sciences naturelles, 41/1857, S. 1-1

e-rara.ch (ETH Zürich)

Prediction of Pourbaix diagrams of quinones for redox flow battery by COSMO-RS

Author: Aubry Jean-Marie
Gaudin Théophile
Publication venue: Elsevier
Publication date: 01/05/2022
Field of study

International audienceRedox-flow batteries are relevant to store energy from intermittent sources such as solar panels or wind turbines, thereby smoothing their energy supply. Up to now, most redox-flow batteries are based on vanadium. Vanadium is a rare and expensive material, thus recent research has focused on redox-flow batteries based on organic compounds and, in particular, anthraquinones as electroactive materials. However, the tunability of organic chemistry poses a needle-in-haystack challenge as the structures exhibiting the most desirable electrochemical properties may be hard to pinpoint. Moreover, the low water solubility of the most readily available anthraquinones may hinder their use as battery electrolytes. To aid in such endeavor, a theoretical approach is proposed to predict Pourbaix diagrams of redox-active organic compounds, allowing in silico anticipation of their electrochemical behavior in the entire pH range. DFT/COSMO-RS predicted pKa and reduction potentials are in good agreement with experimental data, and the resulting calculated Pourbaix diagrams are also in agreement with 4 experimental ones from literature, proving the reliability of the method. Finally, the effect of nature and position of some functional groups on the anthraquinone backbone is discussed, illustrating the power of the method to both understand and quantify the electrochemical activity of redox active organic materials

HAL-Artois

Mixture descriptors toward the development of Quantitative Structure-Property Relationship models for the flash points of organic mixtures

Author: Fayet Guillaume
Gaudin Théophile
Rotureau Patricia
Publication venue: 'American Chemical Society (ACS)'
Publication date: 01/01/2015
Field of study

Quantitative structure-property relationships (QSPRs) are increasingly used for the prediction of physicochemical properties of pure compounds, but only a few have been developed to predict the properties of mixtures. In this work, a series of existing and new formulas were proposed to derive mixture descriptors for the development of QSPR models for mixtures. These mixture descriptors were used tomodel the flash points of a series of 435 organic mixture compositions. Multilinear models were obtained using 12 different mathematical formulas,taking into account the linear or nonlinear dependence of the flashpoint on the concentration of each compound. The best model, issued from the newly proposed (x1d1+ x2d2)2 formula, was a four-parameter model presenting good prediction capabilities (with a mean absolute error in prediction of 10.3 °C) compared with existing predictive methods for both mixtures and pure compounds

HAL-INERIS

FigShare

Combining mixing rules with QSPR models for pure chemicals to predict the flash points of binary organic liquid mixtures

Author: Fayet Guillaume
Gaudin Théophile
Rotureau Patricia
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

Flash point is a key property of liquids to evaluate the safety of industrial processes. Mixing rules are commonly used to calculate the flash point of liquid mixtures, but they need knowledge of the ones of pure compounds. Theoretical methods notably based on quantitative structure property relationships (QSPR) already exist to predict flash points of pure compounds. So, in this paper, direct combination of these two types of approaches was investigated to achieve predictions even when the flash points of pure compounds were unknown. Three relevant mixing rules and four QSPR models, based on simple constitutional descriptors, were considered. Based on a data set of 284 experimental data of binary mixtures extracted from literature, two reliable combinations were highlighted. The most accurate one reached an error in prediction of only 2.9 °C but needed knowledge of the boiling point and Antoine's coefficients of each component of the mixture. A new full-predictive method was in particular proposed with also a low error in prediction (4.4 °C), requiring only knowledge of the molecular structure of each pure compound and molar fraction of the mixture. Errors in each predictive method keep quite reasonable against expected accuracies of direct measurements of flash point of binary mixtures

HAL-INERIS

Estimating the flammability of liquid mixtures using QSPR models

Author: Fayet Guillaume
Gaudin Théophile
Rotureau Patricia
Publication venue: HAL CCSD
Publication date: 01/10/2017
Field of study

HAL-INERIS